Automatic annotation of incomplete and scattered bibliographical references in Digital Humanities papers

نویسندگان

  • Young-Min Kim
  • Patrice Bellot
  • Elodie Faath
  • Marin Dacos
چکیده

In this paper, we deal with the problem of extracting and processing useful information from bibliographic references in Digital Humanities (DH) data. We present our ongoing project BILBO, supported by Google Grant for Digital Humanities that includes the constitution of proper reference corpora and construction of efficient annotation model using several appropriate machine learning techniques. Conditional Random Field is used as a basic approach to automatic annotation of reference fields and Support Vector Machine with a set of newly proposed features is applied for sequence classification. A number of experiments are conducted to find one of the best feature settings for CRF model on these corpora. RÉSUMÉ. L’extraction d’informations bibliographiques depuis un texte non structuré demeure un probléme ouvert que nous abordons, via des approches d’apprentissage automatique, dans le domaine des Humanités Numériques. Nous présentons dans cet article le projet BILBO, soutenu par un Google Digital Humanities Award avec le soutien du projet ANR CAAS : constitution de 3 corpus de référence correspondant à trois localisations des références, élaboration d’un modéle d’annotation puis évaluation. Les champs aléatoires conditionnels (CRFs) sont utilisés pour l’annotation des références bibliographiques et des machines à vecteurs supports (SVMs) pour l’identification des références au sein du texte. De nombreuses expériences sont conduites afin de déterminer les meilleures propriétés devant être exploitées par les modèles numériques.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Annotated Bibliographical Reference Corpora in Digital Humanities

In this paper, we present new bibliographical reference corpora in digital humanities (DH) that have been developed under a research project, Robust and Language Independent Machine Learning Approaches for Automatic Annotation of Bibliographical References in DH Books supported by Google Digital Humanities Research Awards. The main target is the bibliographical references in the articles of Rev...

متن کامل

Bilbo-Val: Automatic Identification of Bibliographical Zone in Papers

In this paper, we present the automatic annotation of bibliographical references’ zone in papers and articles of XML/TEI format. Our work is applied through two phases: first, we use machine learning technology to classify bibliographical and non-bibliographical paragraphs in papers, by means of a model that was initially created to differentiate between the footnotes containing or not containi...

متن کامل

Fuzzy Neighbor Voting for Automatic Image Annotation

With quick development of digital images and the availability of imaging tools, massive amounts of images are created. Therefore, efficient management and suitable retrieval, especially by computers, is one of themost challenging fields in image processing. Automatic image annotation (AIA) or refers to attaching words, keywords or comments to an image or to a selected part of it. In this paper,...

متن کامل

Expanding a Humanities Digital Library: Musical References in Cervantes' Works

Digital libraries focused on developing humanities resources for both scholarly and popular audiences face the challenge of bringing together digital resources built by scholars from different disciplines and subsequently integrating and presenting them. This challenge becomes more acute as libraries grow, both in terms of size and organizational complexity, making the traditional humanities pr...

متن کامل

Exploring Regional Development of Digital Humanities Researches: A Case Study for Taiwan

This study analyzed references and source papers of the Proceedings of 2009-2012 International Conference of Digital Archives and Digital Humanities (DADH), which was held annually in Taiwan. A total of 59 sources and 1,104 references were investigated, based on descriptive analysis and subject analysis of library practices on cataloguing. Preliminary results showed historical materials, events...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012